We are analyzing the quality of wine based on 11 predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. Quality, in this instance, is defined as an indicator of its craftsmanship and, thus, desirability which can be used, for example, in pricing. Quality does not necessarily indicate if a wine has gone bad. We will use two data sets – one for white wine and the other for red – to create two respective machine learning models. There are 4899 observations in the white wine data set and 1600 observations in the red wine data set.
We are interested in conducting analysis on both the white and red wine to assess the quality of wine in new wines in production so that they may be accurately priced and marketed.
This project uses data on the white and red wine, which records information of the chemical makeup of the wine.
white_og <- read.csv("Wine Dataset/winequality-white.csv", sep = ";")
red_og <- read.csv("Wine Dataset/winequality-red.csv", sep = ";")To clean our data, we clean the column names, change quality into a factor so we can analyze it with classification models, and add a type of White or Red to each data set. We removed any rows in the white wine data set with a quality value of 9 because there are too few instances, and thus, it inhibits our models from performing correctly later on. The red wine data set did not have any quality values of 9. We also create a combined data frame with both values from white and red wine to see if there are significant differences between red and white evaluations for quality.
In our data split, we put a proportion of .7 of each original data set into a training data set and a proportion of .3 into the testing data sets, stratifying by quality. In this section, we also folded our data into 5 folds for later cross validation use.
set.seed(1234)
white_split <- white %>%
initial_split(prop = 0.7, strata = "quality")
white_train <- training(white_split)
white_test <- testing(white_split)
red_split <- red %>%
initial_split(prop = 0.7, strata = "quality")
red_train <- training(red_split)
red_test <- testing(red_split)
white_fold <- vfold_cv(white_train, v = 5)
red_fold <- vfold_cv(red_train, v = 5)What sort of factors do winemakers and sommeliers look for in a quality wine? Generally, quality is determined by acidity, dryness, flavor profile or taste, alcohol content, and how well the wine is preserved or how it changes as it is stored. In our exploratory data analysis, we will analyze our predictors based on these five categories. First, acidity levels can be summarized through the ph levels, fixed.acidity, volatile.acidity, and citric.acid content. Dryness is determined by the density. Taste can be broken down into sweetness and saltiness, which are caused by residual.sugar and chlorides respectively. We will analyze alcohol content singularly to see its effect on the wine quality. Lastly, sulfurous compounds are what is generally used to preserve wine, so we will analyze free.sulfur.dioxide, total.sulfur.dioxide, and sulphates to see if the way a wine is preserved interacts with wine quality in an interesting way.
Our data can be split into two data sets because experts look for different levels of acidity, sugar, etc. for white wine and red wine. Thus, we will have 3 different representations of the data: one for white wine, one for red wine, and one for both.
All of our predictors are continuous, so we will use box plots, histograms, and scatter plots to visualize our data and determine feature selection.
First, let’s see the distribution of quality between both data sets of wine.
ggplot(combinedWine_og, aes(quality)) + geom_bar(color = "black", fill = "pink") + labs(title = "Histogram of Quality - Total Wine") + xlab("Quality of Wine") + ylab("Count") We can see that it is normally distributed, meaning that most wine has a quality value of 5 or 6, with few exceptionally good wines having a value of 8 or 9, and low quality wines having a quality value of 3. Based on their low frequency, we can further justify selecting against of quality values of 9 in our initial data cleaning.
Next, we look at the correlation matrices for white and red wine separately to determine which predictors are correlated.
white %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color', main = 'White Wine Correlation Plot')red %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color', main = 'Red Wine Correlation Plot')In the white wine correlation matrix, density and residual sugar; and density and alcohol are the predictors with the highest correlation. Total sulfur dioxide and free sulfur dioxide also have a moderate correlation.
In the red wine correlation matrix, citric acid and fixed acidity; density and fixed acidity; citric acid and volatile acidity; pH and fixed acidity; and free sulfur dioxide and total sulfur dioxide are highly correlated with each other.
To visualize and validate these correlations, let’s take a look at the scaled scatter plot of each predictor plotted against its correlated counterpart.
# scaled data sets
scaled_white = as.data.frame(scale(select(white, c(-quality,-type))))
scaled_red = as.data.frame(scale(select(red, c(-quality,-type))))
# scaled white residual sugar versus density
ggplot(scaled_white, aes(x = residual.sugar, y = density)) + geom_point()+scale_x_continuous(name = "Residual Sugar") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle(" Residual Sugar Versus Density") + theme(plot.title = element_text(size = 5))# scaled white alcohol versus density
ggplot(scaled_white, aes(x = alcohol, y = density)) + geom_point()+scale_x_continuous(name = "Alcohol") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Alcohol Versus Density") + theme(plot.title = element_text(size = 5))# scaled white free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_white, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 5))# scaled red volatile acidity versus citric acid
ggplot(scaled_red, aes(x = volatile.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Volatile Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Volatile Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 5))# scaled red fixed acidity versus citric acid
ggplot(scaled_red, aes(x = fixed.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 5))# scaled red fixed acidity versus pH
ggplot(scaled_red, aes(x = fixed.acidity, y = pH)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "pH") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus pH") + theme(plot.title = element_text(size = 5))# scaled red fixed acidity versus density
ggplot(scaled_red, aes(x = fixed.acidity, y = density)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Density") + theme(plot.title = element_text(size = 5))# scaled red free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_red, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Dioxide Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 5))Based on the scatter plots, we can visualize the correlations between the predictors. For example, for white wine, density has a strong positive correlation with residual sugar and a moderate negative correlation with alcohol. Through these scatter plots, we confirm the existence of correlations predicted by our initial correlation matrix.
Now, we can take a look at the box plots for several of our predictors to see the ways that they interact with wine quality, isolated from the other predictors. First, we will visualize acidity levels which can be measured through fixed acidity, volatile acidity, and citric acid levels. As shown above in the scatter plots, these three predictors are highly correlated with each other in red wine.
#fixed acidity
ggplot(combinedWine, mapping = aes(x = `fixed.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Fixed Acidity Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) +coord_cartesian( xlim = c(0,16), ylim = NULL, default = FALSE )# volatile acidity
ggplot(combinedWine, mapping = aes(x = `volatile.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Volatile Acidity Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )# citric acid
ggplot(combinedWine, mapping = aes(x = `citric.acid`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Citric Acid Levels Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )From these box plots, we can see that fixed acidity levels are relatively consistent in both red and white wine. Volatile acidity has a negative correlation with quality in red wine, but relatively consistent averages for each level of wine quality in white wine. Citric acid levels in red wine have a stronger positive correlation than in white wine. In general, we can see that acidity levels fluctuate more in red wine than in white wine.
Next, let’s take a look at dryness which is determined by the predictor density. Based on the correlation matrix and scatter plots, density also is correlated with residual sugar and alcohol in white wine and with fixed acidity in red wine.
# density
ggplot(combinedWine, mapping = aes(x = density, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = density, y = quality)) + labs(title = "Red and White Density Levels versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(.985,1.01), ylim = NULL, default = FALSE )Although density is correlated with several predictors according to the correlation matrices and scatter plots, in this box plot, we can see that density stays relatively consistent, around 1, for each level of wine quality.
Next, we will look at the taste of the wine, which is determined by levels of sweetness and saltiness. These are affected by sugar levels and chlorides respectively.
#sugar content
ggplot(combinedWine, mapping = aes(x = residual.sugar, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = residual.sugar, y = quality)) + labs(title = "Red and White Residual Sugar Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,25), ylim = NULL, default = FALSE )# we removed outliers to ensure that the variation was not due to the outliers
ggplot(red[red$free.sulfur.dioxide < 50,], aes(x = free.sulfur.dioxide, y = quality)) +
geom_boxplot(aes(fill = quality)) +
labs(title = "Free Sulfur Dioxide for Red Wine", x = "Free Sulfur Dioxide", y = "Quality") +
geom_point(width = 0.15) +
scale_fill_brewer(palette = "RdPu")#chlorides
ggplot(combinedWine, mapping = aes(x = chlorides, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = chlorides, y = quality)) + labs(title = "Red and White Chloride Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,0.3), ylim = NULL, default = FALSE )White wine has, on average, higher and more variable sugar levels than red wine while red wine has an on average higher chloride content than white wine. There seem to be a higher number of outliers in the values of chloride.
Next, we will analyze alcohol content, which can affect the taste of the wine as well.
ggplot(combinedWine, mapping = aes(x = alcohol, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Alcohol Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) There is generally higher alcohol content associated with a wine of higher quality but there is not a significant different in averages between red wine and white wine.
Lastly, let’s look at the preservative content, which is determined by free sulfur dioxide, total sulfur dioxide, and sulfates.
#free sulfur dioxide
ggplot(combinedWine, mapping = aes(x = free.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = free.sulfur.dioxide, y = quality)) + labs(title = "Red and White Free Sulfur Dioxide Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,150), ylim = NULL, default = FALSE )#total sulfur dioxide
ggplot(combinedWine, mapping = aes(x = total.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = total.sulfur.dioxide, y = quality)) + labs(title = "Red and White Total Sulfur Dioxide Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,300), ylim = NULL, default = FALSE )#sulfates
ggplot(combinedWine, mapping = aes(x = sulphates, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = sulphates, y = quality)) + labs(title = "Red and White Sulfate Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1.5), ylim = NULL, default = FALSE )From the box plots we can see that red wine generally has a lower sulfur dioxide content than white wine. Also, averages across each stratification of quality have similar values except for sulfates in red wine, which have a slight positive correlation with quality.
CONCLUDING SENTENCE; CONNECT TO HOW KNOWING THESE THINGS AFFECTS OUR MODELLING
We will be fitting linear discriminant analysis, naive Bayes, single decision tree, random forest, and boosted tree models and compare accuracy metrics. Then, we will fit the three models with the best roc_auc to our testing data. First, let’s see how the models perform on the white wine data set.
white_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())
red_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())We did all of our modeling in R-Scripts for efficiency purposes, since models generally take a long time to run. We will load all of results here, and intermittently call variables throughout the report to visualize our calculations.
# LDA
load("WhiteLDA.rda")
load("RedLDA.rda")
# LDA with PCA
load("WhiteLDAPCA.rda")
load("RedLDAPCA.rda")
# Decision Tree
load("WhiteWineDecisionTree.rda")
load("RedWineDecisionTree.rda")
# Random Forest
load("WhiteWineRandomForest.rda")
load("RedWineRandomForest.rda")
# Boosted Trees
load("WhiteWineBoostedTrees.rda")
load("RedWineBoostedTrees.rda")Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.
#set up model with mode classification and engine MASS
wlda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
#add model and recipe to the workflow
wlda_wkflow<- workflow() %>%
add_model(wlda_model) %>%
add_recipe(white_recipe)
#create a fit between the workflow and folded data
wlda_fit_cross <- fit_resamples(wlda_wkflow, white_fold)
#determine the roc_auc of the LDA model on the folded training data
collect_metrics(wlda_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.528 5 0.00640 Preprocessor1_Model1
## 2 roc_auc hand_till 0.735 5 0.0159 Preprocessor1_Model1
TRANSITION FROM LDA TO BAYES
#set up model with mode classification and engine kLaR
#we used set_args(use_kernel = FALSE) based on Lab 3
wnb_mod <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
#add model and recipe to the workflow
wnb_wkflow <- workflow() %>%
add_model(wnb_mod) %>%
add_recipe(white_recipe)
#create a fit between the workflow and folded data
wnb_fit_cross <- fit_resamples(wnb_wkflow, white_fold)
#determine the roc_auc of the Naive Bayes model on the folded training data
collect_metrics(wnb_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.451 5 0.00984 Preprocessor1_Model1
## 2 roc_auc hand_till 0.734 5 0.0200 Preprocessor1_Model1
Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will use linear discriminant analysis on the white wine data set. Now, let’s explore how the model performs on our testing data.
add pca lda here
#fit the workflow to the training data
wlda_fit <- fit(wlda_wkflow, white_train)
#predict on the testing data, selecting for quality as the classification response
wlda_res <- predict(wlda_fit, new_data = white_test %>% select(-quality), type = "class" )
#add a column for the actual value of quality next to the predicted value
wlda_res <- bind_cols(wlda_res, white_test %>% select(quality)) # returning the accuracy, roc_auc, roc_auc curves, heatmap
wlda_pred <- augment(wlda_fit, new_data = white_test)
wlda_acc <- wlda_pred %>% accuracy(truth = quality, estimate = .pred_class)
wlda_rocauc <- wlda_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White LDA Model")
wlda_roccurve <- wlda_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wlda_confusionmatrix <- wlda_pred %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
print(wlda_acc)
print(wlda_rocauc)
wlda_roccurve
wlda_confusionmatrixThrough the linear discriminant analysis model, we get an accuracy metric of .537 meaning that our model is moderately accurate. Now, let’s try several tree methods to see if the produce more accurate results on the training data set of white wine. We will first look at the model for a single decision tree.
# decision tree specification
wtree_spec <- decision_tree() %>%
set_engine("rpart")
wtree_spec_class <- wtree_spec %>%
set_mode("classification")wclass_tree_fit <- wtree_spec_class %>%
fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)wclass_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()# augmented on training
augment(wclass_tree_fit, new_data = white_train) %>%
accuracy(truth = quality, estimate = .pred_class)
augment(wclass_tree_fit, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class)# tuning cost complexity
wclass_tree_wf<- workflow() %>%
add_model(wtree_spec_class %>%
set_args(cost_complexity = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)
tune_res_white <- tune_grid(
wclass_tree_wf,
resamples = white_fold,
grid = param_grid,
metric = metric_set(accuracy)
)#autoplot(tune_res_white)
wAutoPlot# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_white)
wclass_tree_final <- finalize_workflow(wclass_tree_wf, best_complexity)
wclass_tree_final_fit <- fit(wclass_tree_final, data = white_train)wclass_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()# augmented on training
wdectree_pred <- augment(wclass_tree_final_fit, new_data = white_train)
wdectree_acc <- wdectree_pred %>% accuracy(truth = quality, estimate = .pred_class)
wdectree_rocauc <- wdectree_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Decision Tree Model")
wdectree_roccurve <- wdectree_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wdectree_confusionmatrix <- augment(wclass_tree_final_fit, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")print(wdectree_acc)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.647
print(wdectree_rocauc)## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.818 White Decision Tree Model
wdectree_roccurvewdectree_confusionmatrix# JUST LOAD IN INFO WITH THIS RATHER THAN RERUNNING
load(file = "WhiteWineDecisionTree.rda")decision tree analysis
Now let’s look at the random forest model.
#load saved data
load(file = "WhiteWineRandomForest.rda")
# setting random forest model up
wrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_wf <- workflow() %>%
add_model(wrandfor %>%
set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
# tuning the model to find the best arguments
param_grid1 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)
wtune_res_randfor <- tune_grid(
wrandfor_wf,
resamples = white_fold,
grid = param_grid1,
metric = metric_set(accuracy)
)
wAutoPlotRF <- autoplot(wtune_res_randfor)
# collecting metrics to find best mean
wbest_rocauc1 <- collect_metrics(wtune_res_randfor) %>% arrange(desc(mean))
wbest_metric1 <- select_best(wtune_res_randfor)
wrandfor_final <- rand_forest(mtry = 2, trees = 17, min_n = 30) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_fit_final <- fit(wrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
# variance importance plot
wVIP <- vip(wrandfor_fit_final)
wVIP
# extracting the metrics
wrandfor_pred <- augment(wrandfor_fit_final, new_data = white_train)
wrandfor_acc <- wrandfor_pred %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc <- wrandfor_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve <- wrandfor_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix <- augment(wrandfor_fit_final, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
print(wrandfor_acc)
print(wrandfor_rocauc)
wrandfor_roccurve
wrandfor_confusionmatrixrandom forest analysis
Now, let’s look at the boosted tree model.
#load saved data
load(file = "WhiteWineBoostedTrees.rda")
wboost_spec <- boost_tree(tree_depth = 5) %>%
set_engine("xgboost") %>%
set_mode("classification")
wboost_wf <- workflow() %>%
add_model(wboost_spec %>%
set_args(trees = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid3 <- grid_regular(trees(range = c(10,2000)), levels = 10)
wtune_res_boosted <- tune_grid(
wboost_wf,
resamples = white_fold,
grid = param_grid3
)
wBoostedAutoPlot <- autoplot(wtune_res_boosted)
wBoostedAutoPlot
wbest_rocauc2 <- collect_metrics(wtune_res_boosted) %>% arrange(desc(mean))
wbest_metric2 <- select_best(wtune_res_boosted)
print(wbest_metric2)
wboost_final <- boost_tree(tree_depth = 5, trees = 231)%>%
set_engine("xgboost") %>%
set_mode("classification")
wboost_fit_final <- fit(wboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
# augmenting
wpredicted <- augment(wboost_fit_final, new_data = white_train)
wboosted_acc <- wpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc <- wpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurve <- wpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrix <- wpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wpredicted
wboosted_acc
wboosted_rocauc
wBoostedROCCurve
wBoostedConfusionMatrix# augment all the rocs from decision, randforest, boosted --> see which one has the best --> fit best on testing
# WHITE
wbest_rocauc
wbest_rocauc1
wbest_rocauc2
wbest_roc_table <- rbind(wbest_rocauc[1,c(2,4)], wbest_rocauc1[1,c(4,6)], wbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
wbest_roc_table
# We will test our model using Boosted Trees and Random Forest explain the best ones of the tree type models also mention between lda and naive bayes, lda was better, but we are gonna try using pca first
white_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())
# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(white_recipe_pca$template))
# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)
wlda_pca_wkflow <- workflow() %>%
add_model(wlda_model) %>%
add_recipe(white_recipe_pca)
pca_lda_fit <- wlda_pca_wkflow %>%
tune_grid(resamples = white_fold,
grid = tuneGrid,
metrics = metric_set(accuracy),
control = trControl)
pca_lda_metrics <- pca_lda_fit %>% collect_metrics()
ggplot(pca_lda_metrics, aes(x = num_comp, y = mean)) +
geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
scale_x_continuous(breaks = 1:ncol(white_recipe_pca$template)) +
facet_wrap(~.metric) +
theme_bw()
pca_lda_fit %>% show_best(metric = "accuracy")
(bestTune <- pca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))
wlda_pca_wkflow_final <- wlda_pca_wkflow %>% finalize_workflow(bestTune)
fit_final <- wlda_pca_wkflow_final %>% fit(white_train)
white.PCALDA <- tibble(white_train,
predict(fit_final, new_data = white_train, type = "class"), # predicted class
predict(fit_final, new_data = white_train, type = "prob"), # posterior prob. for classes
as_tibble(predict(fit_final, new_data = white_train, type = "raw")$x)) # LD scores
white.PCALDA
# plot
ggplot(white.PCALDA, aes(x = LD1, y = LD2)) +
geom_point(aes(color = quality, shape = .pred_class)) +
theme_bw() +
ggtitle("PCA-LDA (DAPC) on White Wine Training dataset, using 9 PC")
#augmented on training
pcalda_fit <- augment(fit_final, new_data = white_train)
pcalda_acc <- pcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_rocauc <- pcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
pcalda_roccurve <- pcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_confusionmatrix <- pcalda_fit %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
pcalda_acc
pcalda_rocauc
pcalda_roccurve
pcalda_confusionmatrix# lda using pca testing
pcalda_fit_test <- augment(fit_final, new_data = white_test)
pcalda_test_acc <- pcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_test_rocauc <- pcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA")
pcalda_test_roccurve <- pcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_test_confusionmatrix <- pcalda_fit_test %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
pcalda_test_acc
pcalda_test_rocauc
pcalda_test_roccurve
pcalda_test_confusionmatrix
# random forest testing
wrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test)
wrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
print(wrandfor_acc_test)
print(wrandfor_rocauc_test)
wrandfor_roccurve_test
wrandfor_confusionmatrix_test
# boosted trees testing
wpredictedtest <- augment(wboost_fit_final, new_data = white_test)
wboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wpredictedtest
wboosted_acc_test
wboosted_rocauc_test
wBoostedROCCurveTesting
wBoostedConfusionMatrixTestingwhite conclusion
Next, let’s see how the models perform on the red wine data set. We will be using the same type of models that we used in white wine in order to keep things consistent.
#lda model using cross validation
rlda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
rlda_wkflow<- workflow() %>%
add_model(rlda_model) %>%
add_recipe(red_recipe)
rlda_fit_cross <- fit_resamples(rlda_wkflow, red_fold)
collect_metrics(rlda_fit_cross)
#naive bayes model using cross validation
rnb_mod <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
rnb_wkflow <- workflow() %>%
add_model(rnb_mod) %>%
add_recipe(red_recipe)
rnb_fit_cross <- fit_resamples(rnb_wkflow, red_fold)
collect_metrics(rnb_fit_cross)Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will also use linear discriminant analysis on the red wine data set.
# fitting the model
rlda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
# creating the workflow
rlda_wkflow <- workflow() %>%
add_model(rlda_model) %>%
add_recipe(red_recipe)
# fitting the model to training data
rlda_fit <- fit(rlda_wkflow, red_train)
# now using that fit to test the model on the testing data
rlda_res <- predict(rlda_fit, new_data = red_test %>% select(-quality), type = "class" )
rlda_res <- bind_cols(rlda_res, red_test %>% select(quality))
# returning our predictions and visualizations
rlda_pred <- augment(rlda_fit, new_data = red_test)
rlda_acc <- rlda_pred %>% accuracy(truth = quality, estimate = .pred_class)
rlda_rocauc <- rlda_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8)%>% mutate(model_type = "Red LDA Model")
rlda_roccurve <- rlda_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rlda_confusionmatrix <- rlda_pred %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
# predictions, accuracy, roc auc, roc curve, confusion matrix
print(rlda_acc)
print(rlda_res)
print(rlda_rocauc)
rlda_roccurve
rlda_confusionmatrixdecision tree transition
# decision tree specification
rtree_spec <- decision_tree() %>%
set_engine("rpart")
# setting mode to classification
rtree_spec_class <- rtree_spec %>%
set_mode("classification")
rclass_tree_fit <- rtree_spec_class %>%
fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)
rclass_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
# augmented on training
augment(rclass_tree_fit, new_data = red_train) %>%
accuracy(truth = quality, estimate = .pred_class)
augment(rclass_tree_fit, new_data = red_train) %>%
conf_mat(truth = quality, estimate = .pred_class)
# augmented on testing
augment(rclass_tree_fit, new_data = red_train) %>%
conf_mat(truth = quality, estimate = .pred_class)
augment(rclass_tree_fit, new_data = red_train) %>%
accuracy(truth = quality, estimate = .pred_class)
# tuning cost complexity
rclass_tree_wf<- workflow() %>%
add_model(rtree_spec_class %>%
set_args(cost_complexity = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)
tune_res_red <- tune_grid(
rclass_tree_wf,
resamples = red_fold,
grid = param_grid,
metric = metric_set(accuracy)
)
autoplot(tune_res_red)
# extracting the best cost complexitiy parameter
best_complexity <- select_best(tune_res_red)
rclass_tree_final <- finalize_workflow(rclass_tree_wf, best_complexity)
rclass_tree_final_fit <- fit(rclass_tree_final, data = red_train)
rclass_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
# augmented on training
augment(rclass_tree_final_fit, new_data = red_train) %>%
accuracy(truth = quality, estimate = .pred_class)
augment(rclass_tree_final_fit, new_data = red_train) %>%
conf_mat(truth = quality, estimate = .pred_class)
# augmented on testing
augment(rclass_tree_final_fit, new_data = red_test) %>%
conf_mat(truth = quality, estimate = .pred_class)
augment(rclass_tree_final_fit, new_data = red_test) %>%
accuracy(truth = quality, estimate = .pred_class)random forest transition
# setting random forest model up
rrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_wf <- workflow() %>%
add_model(rrandfor %>%
set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
# tuning the model to find the best arguments
param_grid2 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)
rtune_res_randfor <- tune_grid(
rrandfor_wf,
resamples = red_fold,
grid = param_grid2
)
rAutoPlotRF <- autoplot(rtune_res_randfor)
# collecting metrics to find best mean
rbest_rocauc1 <- collect_metrics(rtune_res_randfor) %>% arrange(desc(mean))
print(rbest_rocauc1)
rbest_metric1 <- select_best(rtune_res_randfor)
rrandfor_final <- rand_forest(mtry = 7, trees = 17, min_n = 32) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_fit_final <- fit(rrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,data = red_train)
rVIP <- vip(rrandfor_fit_final)
rVIP
# saving
save(rAutoPlotRF, rbest_rocauc1, rrandfor_final, rVIP, rrandfor_fit_final, file = "WhiteWineRandomForest.rda")
#load saved data
#load()boosted tree transition
# Red Boosted Trees
rboost_spec <- boost_tree(tree_depth = 5) %>%
set_engine("xgboost") %>%
set_mode("classification")
rboost_wf <- workflow() %>%
add_model(rboost_spec %>%
set_args(trees = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid4 <- grid_regular(trees(range = c(10,2000)), levels = 10)
rtune_res_boosted <- tune_grid(
rboost_wf,
resamples = red_fold,
grid = param_grid4
)
rBoostedAutoPlot <- autoplot(rtune_res_boosted)
rBoostedAutoPlot
rbest_rocauc2 <- collect_metrics(rtune_res_boosted) %>% arrange(desc(mean))
print(rbest_rocauc2)
rbest_metric2 <- select_best(rtune_res_boosted)
print(rbest_metric2)
rboost_final <- boost_tree(tree_depth = 5, trees = 231)%>%
set_engine("xgboost") %>%
set_mode("classification")
rboost_fit_final <- fit(rboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)
rpredicted <- augment(rboost_fit_final, new_data = red_train)
rboosted_acc <- rpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc <- rpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurve <- rpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrix <- rpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpredictedtest <- augment(rboost_fit_final, new_data = red_test)
rboosted_acc_test <- rpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc_test <- rpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurveTesting <- rpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrixTesting <- rpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")best fit transition
#RED
rbest_rocauc
rbest_rocauc1
rbest_rocauc2
rbest_roc_table <- rbind(rbest_rocauc[1,c(2,4)], rbest_rocauc1[1,c(4,6)], rbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
rbest_roc_table
# We will test our model using Boosted Trees and Random Forest red_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())
# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(red_recipe_pca$template))
# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)
rlda_pca_wkflow <- workflow() %>%
add_model(rlda_model) %>%
add_recipe(red_recipe_pca)
rpca_lda_fit <- rlda_pca_wkflow %>%
tune_grid(resamples = red_fold,
grid = tuneGrid,
metrics = metric_set(accuracy),
control = trControl)
rpca_lda_metrics <- rpca_lda_fit %>% collect_metrics()
ggplot(rpca_lda_metrics, aes(x = num_comp, y = mean)) +
geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
scale_x_continuous(breaks = 1:ncol(red_recipe_pca$template)) +
facet_wrap(~.metric) +
theme_bw()
rpca_lda_fit %>% show_best(metric = "accuracy")
(bestTune <- rpca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))
rlda_pca_wkflow_final <- rlda_pca_wkflow %>% finalize_workflow(bestTune)
rfit_final <- rlda_pca_wkflow_final %>% fit(red_train)
red.PCALDA <- tibble(red_train,
predict(rfit_final, new_data =red_train, type = "class"), # predicted class
predict(rfit_final, new_data = red_train, type = "prob"), # posterior prob. for classes
as_tibble(predict(rfit_final, new_data = red_train, type = "raw")$x)) # LD scores
# plot
ggplot(red.PCALDA, aes(x = LD1, y = LD2)) +
geom_point(aes(color = quality, shape = .pred_class)) +
theme_bw() +
ggtitle("PCA-LDA (DAPC) on Red Wine Training dataset, using 9 PC")
#augmented on training
rpcalda_fit <- augment(rfit_final, new_data = red_train)
rpcalda_acc <- rpcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_rocauc <- rpcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
rpcalda_roccurve <- rpcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_confusionmatrix <- rpcalda_fit %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpcalda_acc
rpcalda_rocauc
rpcalda_roccurve
rpcalda_confusionmatrixfinal testing transition
# lda using pca test
rpcalda_fit_test <- augment(rfit_final, new_data = red_test)
rpcalda_test_acc <- rpcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_test_rocauc <- rpcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "Red Wine LDA Model using PCA")
rpcalda_test_roccurve <- rpcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_test_confusionmatrix <- rpcalda_fit_test %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpcalda_test_acc
rpcalda_test_rocauc
rpcalda_test_roccurve
rpcalda_test_confusionmatrixred conclusion